The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

[ INSERT RESEARCH QUESTIONS HERE ]

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries

library(rvest)
library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Function to get NBA roster for a specified year

get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}

Example usage

year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
NA
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
    xlab = "Minutes Played",
    ylab = "Points",
    xlim = c(0.0, 48),
    ylim = c(0.0, 35),   
    main = "Minutes Played vs Points Scored"
)

#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
    xlab = "Field Goal Attempts",
    ylab = "Field Goal Made",
    xlim = c(0.0, b_FGA),
    ylim = c(0.0, b_FG),     
    main = "Field Goal Attempt vs Field Goal Made"
)

NA
NA
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

ASSIGNMENT 1: Is the data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,

Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from ” chararcter” to “double”

For players who are missing data

nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))
NA
NA
NA

To determine whether a player is “top tier” and should be considered a part of a “Big 3” lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.


# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)
NA
NA

ASSIGNMENT 2: Is the advanced data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,

cleaning similar to first one

The script below provide code to clean out the quality issues presented in the dataframe

#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]
Error: object 'dataframe' not found

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))
Warning: There were 22 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(G:VORP, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 21 remaining warnings.

ASSIGNMENT 3: Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.

#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by = c("Rk", "Player", "Pos","Age", "Tm","G"))#, by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

ASSIGNMENT 4: Make a function with argument year that outputs one dataframe with the merged traditional and advanced data.


combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


#take out the N/A 
nba_roster2<-na.omit(nba_roster2)


# Convert specific columns from character to double

nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}

ASSIGNMENT 5: Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.

---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
```

Function to get NBA roster for a specified year
```{r}
get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
```

Example usage
```{r}
year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)

```
```{r}
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
```



```{r}

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig


```
```{r}
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
	xlab = "Minutes Played",
	ylab = "Points",
	xlim = c(0.0, 48),
	ylim = c(0.0, 35),	 
	main = "Minutes Played vs Points Scored"
)

```


```{r}
#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
	xlab = "Field Goal Attempts",
	ylab = "Field Goal Made",
	xlim = c(0.0, b_FGA),
	ylim = c(0.0, b_FG),	 
	main = "Field Goal Attempt vs Field Goal Made"
)


```



```{r}
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 

```


**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


For players who are missing data
```{r}
nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))



```

To determine whether a player is "top tier" and should be considered a part of a "Big 3" lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS 

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.
```{r}

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)


```

**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]

#2 Now we want to remove the filler rows that had been used as headers on the webpage

newdata.frame<-dataframe[-c(502:526), ]

#3 now we want to remove all the N/As from the dataset
dataframe %>% 
  select(where(~!all(is.na(.))))
```
 
 
 
```{r}

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))




```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)


head(nba_merge)
```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 

```{r}

combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


#take out the N/A 
nba_roster2<-na.omit(nba_roster2)


# Convert specific columns from character to double

nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}



```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*


